5-B. Sentiment Analysis: Vader¶

In [1]:
# pip install tweepy==4.1.0
# pip install vaderSentiment
In [2]:
import os
import time
import math
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import multiprocessing
from pandarallel import pandarallel
import requests
import sys

import nltk
from textblob import TextBlob
from wordcloud import WordCloud
from google.cloud import storage
from textblob.sentiments import NaiveBayesAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pytz

import spacy
from collections import Counter
import concurrent.futures

import warnings

warnings.simplefilter('once')
warnings.simplefilter('ignore')
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
In [3]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)
In [4]:
num_processors = multiprocessing.cpu_count()
num_processors

workers = num_processors-1

print(f'Using {workers} workers')
Using 15 workers
In [5]:
pandarallel.initialize(nb_workers=workers, use_memory_fs=False)
INFO: Pandarallel will run on 15 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

1. Import Data¶

In [6]:
%%time

file_path = 'news_cleaned.parquet'
news = pd.read_parquet(file_path)
CPU times: user 20.1 s, sys: 26.6 s, total: 46.7 s
Wall time: 26.3 s
In [7]:
news = news.reset_index(drop = True)
In [8]:
news.shape # (198064, 16)
Out[8]:
(198064, 16)
In [9]:
news.columns
Out[9]:
Index(['url', 'date', 'language', 'title', 'text', 'year', 'month', 'day',
       'text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned',
       'title_lemm', 'title_word_count', 'text_word_count'],
      dtype='object')
In [10]:
news.sample(1, random_state = 42)[['text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned', 'title_lemm']]
Out[10]:
text_ner text_cleaned text_lemm title_ner title_cleaned title_lemm
196666 Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images Skip to contentCommunity Coverage TourHome ProMedically SpeakingBest of the WestChampions in AgBack to Our AppsCOVID 19Food for NewsTexasNew to a TipLatest CamsClosings and DelaysSend Us Your Weather PhotosTxDOT Highway ConditionsDownload the Weather AppWeather ResourcesKCBD InvestigatesSubmit a TipChad Read ShootingReagor Dykes CoverageSex Trafficking on the South PlainsLubbock County Medical E... prosecutors states urge congress strengthen tools fight ai child sexual abuse images skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend us weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dykes coveragesex trafficking south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell somethi... prosecutor state urge congress strengthen tool fight ai child sexual abuse image skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend u weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dyke coveragesex traffic south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell something goodnot... Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images prosecutors states urge congress strengthen tools fight ai child sexual abuse images prosecutor state urge congress strengthen tool fight ai child sexual abuse image

2. Setiment analysis with VADER: Positive, Neutral, Negative and Compound¶

Utilize a Sentiment Dictionary to decipher the sentiment of text¶

A sentiment dictionary is the mapping of words to sentiment values. For example: the word awesome (which is a positive sentiment) could have a value of +3.7 and the word horrible (which is a negative sentiment) could have a value of -3.1. While using a sentiment dictionary, the values of the sentiment words are summed to get the overall sentiment of the text.

For example: I loved the ambience of the restaurant but the drive to the restaurant was horrendous. Overall, it was a good evening.

Now let's say the value of the word love is +3.9, the value of the word horrendous is -4.2 and the value of the word good is +2.9. So, the overall sentiment of the text is positive since the aggregate of the values of the sentiment words is positive.

VADER stands for Valence Aware Dictionary for Sentiment Reasoning. The dictionary was designed specifically for Twitter and contains emoticons and slang. It also provides support for sentiment intensifiers (words such as incredibly funny) and negations (words such as "not bad" which is a slight/small positive sentiment).

How it works? VADER analyzes a piece of text to check if any of the words in the text are present in the lexicon. It produces 4 sentiment metrics from the word ratings i.e. positive, neutral, negative and compound. The compound score is the sum of all of the lexicon ratings which is standardized to a range between -1 and 1.

In [11]:
import nltk
nltk.download('vader_lexicon')
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Out[11]:
True
In [12]:
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

Which text to input for sentiment analysis?¶

For sentiment analysis, particularly on news articles related to AI and job market changes, the choice between using data preprocessed for NER (Named Entity Recognition), moderately cleaned data for LDA (Latent Dirichlet Allocation), or lemmatized text, depends on the nature of the sentiment analysis tool you plan to use and the specifics of what you consider "best results". Here's a general guide to help you decide:

Minimally Cleaned Data (NER):

  • Pros: Retains more context and original structure, which can be helpful for capturing sentiment related to specific entities and nuances in the text.
  • Cons: May include noise that could potentially skew sentiment analysis results (e.g., irrelevant punctuation, capitalization, or rare words).
  • Best for: Tools that are good at handling complex language nuances, including capitalization and entity-based sentiments (e.g., VADER, which understands capitalization as emphasis).

Moderately Cleaned Data (LDA):

  • Pros: Removes stopwords and lowercases text, which could help in focusing on the more meaningful words that might carry sentiment.
  • Cons: Some sentiment-bearing terms, especially intensifiers or negations, might be lost if not handled properly.
  • Best for: Traditional sentiment analysis approaches that don’t handle entity recognition and rely more on the overall frequency and presence of sentiment-bearing words.

Lemmatized Text:

  • Pros: Normalizes words to their base form, which can be beneficial for consistency and possibly reducing the feature space.
  • Cons: Lemmatization might alter the meaning of some words, losing the sentiment in the process (e.g., changing "better" to "good" could affect the sentiment score).
  • Best for: Sentiment analysis tools or models that are trained on lemmatized data or when consistency of word forms is crucial.
In [13]:
len(news['text_cleaned'])
Out[13]:
198064
In [14]:
%%time

# Apply VADER to each piece of text and store the results in a new DataFrame
sentiments = news['text_cleaned'].parallel_apply(lambda x: sid.polarity_scores(x))
CPU times: user 1.73 s, sys: 5.23 s, total: 6.96 s
Wall time: 24min 10s
In [15]:
# Convert the result into a DataFrame
df_sentiments = pd.DataFrame(sentiments.tolist())
In [16]:
df_sentiments.isnull().sum()
Out[16]:
neg         0
neu         0
pos         0
compound    0
dtype: int64
In [17]:
# Define a function to interpret the compound score as Positive or Negative
def label_sentiment(row):
    if row['compound'] > 0:
        return 'positive'
    elif row['compound'] < 0:
        return 'negative'
    else:
        return 'neutral'
In [18]:
%%time

# Apply the function to determine Positive, Negative, or Neutral
df_sentiments['sentiment'] = df_sentiments.parallel_apply(label_sentiment, axis=1)
CPU times: user 49.1 ms, sys: 2.1 s, total: 2.15 s
Wall time: 2.37 s
In [19]:
# Keep only the 'sentiment' and 'compound' columns
df_results = df_sentiments[['sentiment', 'compound']]
In [20]:
# Add the results back to the original DataFrame
news['vader_sent'] = df_results['sentiment']
news['vader_comp'] = df_results['compound']
In [21]:
news[news['vader_sent'] == 'positive'][['text_ner', 'vader_sent', 'vader_comp']].sample(3, random_state = 42)
Out[21]:
text_ner vader_sent vader_comp
29974 OpenAI has its tentacles in hundreds of companies. Here s how it s making them more productive. HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SEARCH SHOPPING YAHOO PLUS MORE ... Yahoo Finance Yahoo Finance Sign in Mail Sign in to view your mail Finance Watchlists My Portfolio Crypto Yahoo Finance Plus Dashboard Research Reports Investment Ideas Community Insights Webinars Blog News Latest News Yahoo Finance Originals Stock Market News Earnings Politics Economic News Morning Brief Personal... positive 0.9988
124108 Bright Direction Dental Selects Overjet AI to Elevate Patient Care Skip to Florida WeekendWatch LiveWatch GuideSouth Florida WeekendSportsAbout UsContact UsNextGen TVProgramming ScheduleLatest Country Music LifestyleGray DC BureauInvestigate TVPress ReleasesBright Direction Dental Selects Overjet AI to Elevate Patient CarePublished Jan., at AM EST Updated hours agoThe DSO embraced technological innovation and partnered with Overjet for AI powered radiograph analysis, clinical insights, and o... positive 0.9989
36914 Ai Regulation Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity X We use cookies to ensure best experience for you We use cookies and other tracking technologies to improve your browsing experience on our site, show personalize content and targeted ads, analyze site traffic, and understand where our audience is coming from. You can also read our privacy policy, We use cookies to ensure the best experience for you on our website. By choosing I accept, or by c... positive 0.9976
In [22]:
news[news['vader_sent'] == 'negative'][['text_ner', 'vader_sent', 'vader_comp']].sample(3, random_state = 42)
Out[22]:
text_ner vader_sent vader_comp
72547 Ex Florida data scientist turns herself in after arrest warrant issued Skip to content Go Local Grow with Us Expert Connections Health Connections Contests Moms Talk Baby Boomers Talk Panhandle Deals Viewers Choice Awards Home News WATCH LIVE Weather Closings Coronavirus Vaccine Watch Community Sports About Us Home Election Results Download our Apps WATCH LIVE Go Local News National Crime Education Perspective with Brent McClure Good News With Doppler Dave Coronavirus Vaccine Watch Panhandle... negative -0.9840
105006 Flood forecasts in real time with block by block data could save lives a new machine learning method makes it possible Skip to main content MySA Homepage Currently Reading Flood forecasts in real time with block by block data could save lives a new machine learning method makes it possible Newsletters Sign In HomeSubscribeBuy E N MerchandiseContact UsAbout UsAdvertise With UsPlace a Classified AdPrivacy NoticeNewsletters Text AlertsFind a Business in S.A.Manage by to San AntonioClassified Ma... negative -0.1725
107850 Musk, scientists call for halt to AI race sparked by ChatGPT Skip to contentTornado Disaster InfoWhat s Your Home ShowWeatherWeather MapsRadarWeather BlogWeather AcademyWeather RadioSevere Weather ScoresBeat the AceTeam of the WeekAaron s AcesCheerleader ChallengeCommunity MapGood Morning ArkLaMissGuest RecipesGuest Interview Request FormHealth ConnectionsPerfect HomeOur TownService SaluteSubmit Photos and VideosFeed Your SoulRecommend Your Favorite RestaurantMr. FoodTalking FoodTV ListingsS... negative -0.2247
In [23]:
news.isnull().sum()
Out[23]:
url                 0
date                0
language            0
title               0
text                0
year                0
month               0
day                 0
text_ner            0
text_cleaned        0
text_lemm           0
title_ner           0
title_cleaned       0
title_lemm          0
title_word_count    0
text_word_count     0
vader_sent          0
vader_comp          0
dtype: int64
In [24]:
news.to_parquet('news_vader_sent.parquet')
In [25]:
# Google Cloud Storage details
bucket_name = 'nlp-final'
file_path = 'news_vader_sent.parquet'  # This is the name the file will have in GCS
local_file_path = 'news_vader_sent.parquet'  # Path to the local file you just saved

# Create a GCS Client
storage_client = storage.Client()

# Get the bucket
bucket = storage_client.get_bucket(bucket_name)

# Create a blob object from the filepath
blob = bucket.blob(file_path)

# Upload the file
blob.upload_from_filename(local_file_path)

3-(A). Sentiment over time: Compound Score¶

3.1. Overall Sentiment (Average of Sentiment from Positive and Negative)¶

1. Sentiment Distribution¶

In [26]:
sentiment_counts = news['vader_sent'].value_counts(ascending=False).reset_index()
sentiment_counts.columns = ['Sentiment', 'Count']
sentiment_counts
Out[26]:
Sentiment Count
0 positive 187561
1 negative 9947
2 neutral 556
In [27]:
# Create a bar plot
plt.figure(figsize=(7, 5))
sns.barplot(x='Sentiment', y='Count', data=sentiment_counts)

# Adding title and labels
plt.title('Sentiment Distribution from VADER Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Count')

# Show the plot
plt.show()
No description has been provided for this image
In [28]:
# Compute the absolute values of the vader_comp scores
abs_vader_comp = abs(news['vader_comp'])
In [29]:
abs_vader_comp.describe()
Out[29]:
count    198064.000000
mean          0.971577
std           0.112334
min           0.000000
25%           0.993000
50%           0.997700
75%           0.999100
max           1.000000
Name: vader_comp, dtype: float64
In [30]:
# Create a distplot
plt.figure(figsize=(7, 5))  # Set the size of the plot
sns.distplot(abs_vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})

# Customize the plot
plt.title('Distribution of Absolute VADER Compound Scores')
plt.xlabel('Absolute Compound Score')
plt.ylabel('Density')

plt.show()
No description has been provided for this image
In [31]:
# Compute the absolute values of the vader_comp scores
vader_comp = news['vader_comp']
In [32]:
vader_comp.describe()
Out[32]:
count    198064.000000
mean          0.887799
std           0.410358
min          -1.000000
25%           0.992500
50%           0.997700
75%           0.999000
max           1.000000
Name: vader_comp, dtype: float64
In [33]:
# Create a distplot
plt.figure(figsize=(7, 5))  # Set the size of the plot
sns.distplot(vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})

# Customize the plot
plt.title('Distribution of VADER Compound Scores')
plt.xlabel('Absolute Compound Score')
plt.ylabel('Density')

plt.show()
No description has been provided for this image

2. Sentiment Overtime¶

Year¶

In [34]:
# Group by year and month, and calculate the average sentiment score for each month
yearly_sentiment = news.groupby('year')['vader_comp'].mean().reset_index()
yearly_sentiment.columns = ['Year', 'Average_Sentiment']
In [35]:
yearly_sentiment.head()
Out[35]:
Year Average_Sentiment
0 2020 0.877180
1 2021 0.900908
2 2022 0.914087
3 2023 0.877717
In [36]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_sentiment, x='Year', y='Average_Sentiment', marker='o')

# Customize the plot
plt.title('Yearly Average Sentiment Trend', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(yearly_sentiment['Year'])  # Ensure all years are shown as x-ticks

# Show the plot
plt.show()
No description has been provided for this image
In [37]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news, x='year', y='vader_comp', marker='o')

# Customize the plot
plt.title('Yearly Average Sentiment Over Time', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

Month¶

In [38]:
# Group by year and month, and calculate the average sentiment score for each month
monthly_sentiment = news.groupby(['year', 'month'])['vader_comp'].mean().reset_index()
monthly_sentiment.columns = ['Year', 'Month', 'Average_Sentiment']
In [39]:
monthly_sentiment.head()
Out[39]:
Year Month Average_Sentiment
0 2020 1 0.894119
1 2020 2 0.888097
2 2020 3 0.925711
3 2020 4 0.940956
4 2020 5 0.923580
In [40]:
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]

# Create a larger figure size to prevent overlapping
plt.figure(figsize=(20, 10))

# Create a line plot with the custom color palette
sns.lineplot(x='Month', y='Average_Sentiment', hue='Year', data=monthly_sentiment, marker='o', palette=custom_colors)

# Add titles and labels
plt.title('Monthly Average Sentiment Over Time')
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])  # Month labels from 1 to 12

# Move the legend outside of the plot
plt.legend(title='Year', bbox_to_anchor=(1.02, 1.02), loc='upper left')

# Adjust subplot parameters for better layout
# plt.subplots_adjust(right=0.8)

# Show the plot
plt.show()
No description has been provided for this image
In [41]:
monthly_sentiment['Year_Month'] = monthly_sentiment['Year'].astype(str).str.zfill(2) + '-' + monthly_sentiment['Month'].astype(str).str.zfill(2)
In [42]:
monthly_sentiment.head()
Out[42]:
Year Month Average_Sentiment Year_Month
0 2020 1 0.894119 2020-01
1 2020 2 0.888097 2020-02
2 2020 3 0.925711 2020-03
3 2020 4 0.940956 2020-04
4 2020 5 0.923580 2020-05
In [43]:
# Convert 'Year_Month' to a datetime format for better plotting
monthly_sentiment['Year_Month'] = pd.to_datetime(monthly_sentiment['Year_Month'])
In [44]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=monthly_sentiment, x='Year_Month', y='Average_Sentiment', marker='o')

# Customize the plot
plt.title('Monthly Average Sentiment Over Time', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image
In [45]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news, x='month', y='vader_comp', marker='o')

# Customize the plot
plt.title('Monthly Average Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

Day¶

In [46]:
daily_sentiment = news.groupby(['year', 'month', 'day'])['vader_comp'].mean().reset_index()
daily_sentiment.columns = ['Year', 'Month', 'Day', 'Average_Sentiment']
In [47]:
daily_sentiment.head()
Out[47]:
Year Month Day Average_Sentiment
0 2020 1 1 0.806833
1 2020 1 2 0.671939
2 2020 1 3 0.717555
3 2020 1 4 0.771158
4 2020 1 5 0.852698
In [48]:
daily_sentiment['Month_Day'] = daily_sentiment['Month'].astype(str).str.zfill(2) + '-' + daily_sentiment['Day'].astype(str).str.zfill(2)
In [49]:
daily_sentiment.head()
Out[49]:
Year Month Day Average_Sentiment Month_Day
0 2020 1 1 0.806833 01-01
1 2020 1 2 0.671939 01-02
2 2020 1 3 0.717555 01-03
3 2020 1 4 0.771158 01-04
4 2020 1 5 0.852698 01-05
In [50]:
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]

# Set the style to white (no grid)
sns.set(style="white")

# Create a line plot with a larger figure size
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment, x='Month_Day', y='Average_Sentiment', hue='Year', palette=custom_colors)

# Customize the plot
plt.title('Daily Average Sentiment Trend By Year', fontsize=16)
plt.xlabel('Month-Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)

# Improve x-tick readability
# Show only the first day of each month or every few days
x_ticks = daily_sentiment['Month_Day'].unique()[::10]  # Adjust the step as needed
plt.xticks(x_ticks, rotation=90)  # Rotate x-ticks for better readability

# Place the legend outside the plot
plt.legend(title='Year', bbox_to_anchor=(1.01, 1.01), loc='upper left')

# Adjust subplot parameters for better layout
plt.subplots_adjust(right=0.8)

# Show the plot
plt.show()
No description has been provided for this image
In [51]:
daily_sentiment2 = news.groupby('date')['vader_comp'].mean().reset_index()
daily_sentiment2.columns = ['Date', 'Average_Sentiment']
In [52]:
daily_sentiment2.head()
Out[52]:
Date Average_Sentiment
0 2020-01-01 0.806833
1 2020-01-02 0.671939
2 2020-01-03 0.717555
3 2020-01-04 0.771158
4 2020-01-05 0.852698
In [53]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment2, x='Date', y='Average_Sentiment')

# Customize the plot
plt.title('Daily Average Sentiment Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
x_ticks = daily_sentiment2['Date'].unique()[::30]  # Adjust the step as needed
plt.xticks(x_ticks, rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image
In [54]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news, x='day', y='vader_comp', marker='o')

# Customize the plot
plt.title('Daily Average Sentiment', fontsize=16)
plt.xlabel('Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

3.2. Positive Sentiment (Average of Sentiment from Positive)¶

1. Sentiment Distribution¶

In [55]:
news_po = news[news['vader_sent'] == 'positive']
In [56]:
po_vader_comp = news_po['vader_comp']
In [57]:
po_vader_comp.describe()
Out[57]:
count    187561.000000
mean          0.981748
std           0.079434
min           0.001500
25%           0.994200
50%           0.997900
75%           0.999100
max           1.000000
Name: vader_comp, dtype: float64
In [58]:
# Create a distplot
plt.figure(figsize=(7, 5))  # Set the size of the plot
sns.distplot(po_vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})

# Customize the plot
plt.title('Distribution of VADER Compound Scores from Positive Sentiment')
plt.xlabel('Compound Score')
plt.ylabel('Density')

plt.show()
No description has been provided for this image

2. Sentiment Overtime¶

Year¶

In [59]:
# Group by year and month, and calculate the average sentiment score for each month
yearly_sentiment = news_po.groupby('year')['vader_comp'].mean().reset_index()
yearly_sentiment.columns = ['Year', 'Average_Sentiment']
In [60]:
yearly_sentiment.head()
Out[60]:
Year Average_Sentiment
0 2020 0.979042
1 2021 0.986038
2 2022 0.984295
3 2023 0.980291
In [61]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_sentiment, x='Year', y='Average_Sentiment', marker='o')

# Customize the plot
plt.title('Yearly Average Sentiment Trend from Positive Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(yearly_sentiment['Year'])  # Ensure all years are shown as x-ticks

# Show the plot
plt.show()
No description has been provided for this image
In [62]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='year', y='vader_comp', marker='o')

# Customize the plot
plt.title('Yearly Average Sentiment Over Time from Positive Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

Month¶

In [63]:
# Group by year and month, and calculate the average sentiment score for each month
monthly_sentiment = news_po.groupby(['year', 'month'])['vader_comp'].mean().reset_index()
monthly_sentiment.columns = ['Year', 'Month', 'Average_Sentiment']
In [64]:
monthly_sentiment.head()
Out[64]:
Year Month Average_Sentiment
0 2020 1 0.981678
1 2020 2 0.977027
2 2020 3 0.985241
3 2020 4 0.986130
4 2020 5 0.986621
In [65]:
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]

# Create a larger figure size to prevent overlapping
plt.figure(figsize=(20, 10))

# Create a line plot with the custom color palette
sns.lineplot(x='Month', y='Average_Sentiment', hue='Year', data=monthly_sentiment, marker='o', palette=custom_colors)

# Add titles and labels
plt.title('Monthly Average Sentiment by Year from Positive Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])  # Month labels from 1 to 12

# Move the legend outside of the plot
plt.legend(title='Year', bbox_to_anchor=(1.02, 1.02), loc='upper left')

# Adjust subplot parameters for better layout
# plt.subplots_adjust(right=0.8)

# Show the plot
plt.show()
No description has been provided for this image
In [66]:
monthly_sentiment['Year_Month'] = monthly_sentiment['Year'].astype(str).str.zfill(2) + '-' + monthly_sentiment['Month'].astype(str).str.zfill(2)
In [67]:
monthly_sentiment.head()
Out[67]:
Year Month Average_Sentiment Year_Month
0 2020 1 0.981678 2020-01
1 2020 2 0.977027 2020-02
2 2020 3 0.985241 2020-03
3 2020 4 0.986130 2020-04
4 2020 5 0.986621 2020-05
In [68]:
# Convert 'Year_Month' to a datetime format for better plotting
monthly_sentiment['Year_Month'] = pd.to_datetime(monthly_sentiment['Year_Month'])
In [69]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=monthly_sentiment, x='Year_Month', y='Average_Sentiment', marker='o')

# Customize the plot
plt.title('Monthly Average Sentiment Over Time from Positive Sentiment', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image
In [70]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='month', y='vader_comp', marker='o')

# Customize the plot
plt.title('Monthly Average Sentiment from Positive Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

Day¶

In [71]:
daily_sentiment = news_po.groupby(['year', 'month', 'day'])['vader_comp'].mean().reset_index()
daily_sentiment.columns = ['Year', 'Month', 'Day', 'Average_Sentiment']
In [72]:
daily_sentiment.head()
Out[72]:
Year Month Day Average_Sentiment
0 2020 1 1 0.970577
1 2020 1 2 0.945592
2 2020 1 3 0.980421
3 2020 1 4 0.979463
4 2020 1 5 0.973589
In [73]:
daily_sentiment['Month_Day'] = daily_sentiment['Month'].astype(str).str.zfill(2) + '-' + daily_sentiment['Day'].astype(str).str.zfill(2)
In [74]:
daily_sentiment.head()
Out[74]:
Year Month Day Average_Sentiment Month_Day
0 2020 1 1 0.970577 01-01
1 2020 1 2 0.945592 01-02
2 2020 1 3 0.980421 01-03
3 2020 1 4 0.979463 01-04
4 2020 1 5 0.973589 01-05
In [75]:
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]

# Set the style to white (no grid)
sns.set(style="white")

# Create a line plot with a larger figure size
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment, x='Month_Day', y='Average_Sentiment', hue='Year', palette=custom_colors)

# Customize the plot
plt.title('Daily Average Sentiment Trend by Year from Positive Sentiment', fontsize=16)
plt.xlabel('Month-Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)

# Improve x-tick readability
# Show only the first day of each month or every few days
x_ticks = daily_sentiment['Month_Day'].unique()[::10]  # Adjust the step as needed
plt.xticks(x_ticks, rotation=90)  # Rotate x-ticks for better readability

# Place the legend outside the plot
plt.legend(title='Year', bbox_to_anchor=(1.01, 1.01), loc='upper left')

# Adjust subplot parameters for better layout
plt.subplots_adjust(right=0.8)

# Show the plot
plt.show()
No description has been provided for this image
In [76]:
daily_sentiment2 = news_po.groupby('date')['vader_comp'].mean().reset_index()
daily_sentiment2.columns = ['Date', 'Average_Sentiment']
In [77]:
daily_sentiment2.head()
Out[77]:
Date Average_Sentiment
0 2020-01-01 0.970577
1 2020-01-02 0.945592
2 2020-01-03 0.980421
3 2020-01-04 0.979463
4 2020-01-05 0.973589
In [78]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment2, x='Date', y='Average_Sentiment')

# Customize the plot
plt.title('Daily Average Sentiment Over Time from Positive Sentiment', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
x_ticks = daily_sentiment2['Date'].unique()[::30]  # Adjust the step as needed
plt.xticks(x_ticks, rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image
In [79]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='day', y='vader_comp', marker='o')

# Customize the plot
plt.title('Daily Average Sentiment from Positive Sentiment', fontsize=16)
plt.xlabel('Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

3.2. Positive Sentiment (Average of Sentiment from Positive)¶

1. Sentiment Distribution¶

In [80]:
news_ne = news[news['vader_sent'] == 'negative']
In [81]:
ne_vader_comp = news_ne['vader_comp']
In [82]:
ne_vader_comp.describe()
Out[82]:
count    9947.000000
mean       -0.834087
std         0.242236
min        -1.000000
25%        -0.989600
50%        -0.955200
75%        -0.788400
max        -0.001800
Name: vader_comp, dtype: float64
In [83]:
# Create a distplot
plt.figure(figsize=(7, 5))  # Set the size of the plot
sns.distplot(ne_vader_comp, bins=30, kde=True, hist_kws={'edgecolor':'black'})

# Customize the plot
plt.title('Distribution of VADER Compound Scores from Negative Sentiment')
plt.xlabel('Compound Score')
plt.ylabel('Density')

plt.show()
No description has been provided for this image

2. Sentiment Overtime¶

Year¶

In [84]:
# Group by year and month, and calculate the average sentiment score for each month
yearly_sentiment = news_ne.groupby('year')['vader_comp'].mean().reset_index()
yearly_sentiment.columns = ['Year', 'Average_Sentiment']
In [85]:
yearly_sentiment.head()
Out[85]:
Year Average_Sentiment
0 2020 -0.819097
1 2021 -0.849497
2 2022 -0.848392
3 2023 -0.830405
In [86]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_sentiment, x='Year', y='Average_Sentiment', marker='o')

# Customize the plot
plt.title('Yearly Average Sentiment Trend from Negative Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(yearly_sentiment['Year'])  # Ensure all years are shown as x-ticks

# Show the plot
plt.show()
No description has been provided for this image
In [87]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_ne, x='year', y='vader_comp', marker='o')

# Customize the plot
plt.title('Yearly Average Sentiment Over Time from Negative Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

Month¶

In [88]:
# Group by year and month, and calculate the average sentiment score for each month
monthly_sentiment = news_ne.groupby(['year', 'month'])['vader_comp'].mean().reset_index()
monthly_sentiment.columns = ['Year', 'Month', 'Average_Sentiment']
In [89]:
monthly_sentiment.head()
Out[89]:
Year Month Average_Sentiment
0 2020 1 -0.849019
1 2020 2 -0.802590
2 2020 3 -0.739436
3 2020 4 -0.774740
4 2020 5 -0.799248
In [90]:
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]

# Create a larger figure size to prevent overlapping
plt.figure(figsize=(20, 10))

# Create a line plot with the custom color palette
sns.lineplot(x='Month', y='Average_Sentiment', hue='Year', data=monthly_sentiment, marker='o', palette=custom_colors)

# Add titles and labels
plt.title('Monthly Average Sentiment by Year from Negative Sentiment')
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])  # Month labels from 1 to 12

# Move the legend outside of the plot
plt.legend(title='Year', bbox_to_anchor=(1.02, 1.02), loc='upper left')

# Adjust subplot parameters for better layout
# plt.subplots_adjust(right=0.8)

# Show the plot
plt.show()
No description has been provided for this image
In [91]:
monthly_sentiment['Year_Month'] = monthly_sentiment['Year'].astype(str).str.zfill(2) + '-' + monthly_sentiment['Month'].astype(str).str.zfill(2)
In [92]:
monthly_sentiment.head()
Out[92]:
Year Month Average_Sentiment Year_Month
0 2020 1 -0.849019 2020-01
1 2020 2 -0.802590 2020-02
2 2020 3 -0.739436 2020-03
3 2020 4 -0.774740 2020-04
4 2020 5 -0.799248 2020-05
In [93]:
# Convert 'Year_Month' to a datetime format for better plotting
monthly_sentiment['Year_Month'] = pd.to_datetime(monthly_sentiment['Year_Month'])
In [94]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=monthly_sentiment, x='Year_Month', y='Average_Sentiment', marker='o')

# Customize the plot
plt.title('Monthly Average Sentiment Over Time from Negative Sentiment', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image
In [95]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_po, x='month', y='vader_comp', marker='o')

# Customize the plot
plt.title('Monthly Average Sentiment from Negative Sentiment', fontsize=16)
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

Day¶

In [96]:
daily_sentiment = news_ne.groupby(['year', 'month', 'day'])['vader_comp'].mean().reset_index()
daily_sentiment.columns = ['Year', 'Month', 'Day', 'Average_Sentiment']
In [97]:
daily_sentiment.head()
Out[97]:
Year Month Day Average_Sentiment
0 2020 1 1 -0.994350
1 2020 1 2 -0.895345
2 2020 1 3 -0.969167
3 2020 1 4 -0.634900
4 2020 1 5 -0.537550
In [98]:
daily_sentiment['Month_Day'] = daily_sentiment['Month'].astype(str).str.zfill(2) + '-' + daily_sentiment['Day'].astype(str).str.zfill(2)
In [99]:
daily_sentiment.head()
Out[99]:
Year Month Day Average_Sentiment Month_Day
0 2020 1 1 -0.994350 01-01
1 2020 1 2 -0.895345 01-02
2 2020 1 3 -0.969167 01-03
3 2020 1 4 -0.634900 01-04
4 2020 1 5 -0.537550 01-05
In [100]:
# Custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]

# Set the style to white (no grid)
sns.set(style="white")

# Create a line plot with a larger figure size
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment, x='Month_Day', y='Average_Sentiment', hue='Year', palette=custom_colors)

# Customize the plot
plt.title('Daily Average Sentiment Trend By Year from Negative Sentiment', fontsize=16)
plt.xlabel('Month-Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)

# Improve x-tick readability
# Show only the first day of each month or every few days
x_ticks = daily_sentiment['Month_Day'].unique()[::10]  # Adjust the step as needed
plt.xticks(x_ticks, rotation=90)  # Rotate x-ticks for better readability

# Place the legend outside the plot
plt.legend(title='Year', bbox_to_anchor=(1.01, 1.01), loc='upper left')

# Adjust subplot parameters for better layout
plt.subplots_adjust(right=0.8)

# Show the plot
plt.show()
No description has been provided for this image
In [101]:
daily_sentiment2 = news_ne.groupby('date')['vader_comp'].mean().reset_index()
daily_sentiment2.columns = ['Date', 'Average_Sentiment']
In [102]:
daily_sentiment2.head()
Out[102]:
Date Average_Sentiment
0 2020-01-01 -0.994350
1 2020-01-02 -0.895345
2 2020-01-03 -0.969167
3 2020-01-04 -0.634900
4 2020-01-05 -0.537550
In [103]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(20, 10))
sns.lineplot(data=daily_sentiment2, x='Date', y='Average_Sentiment')

# Customize the plot
plt.title('Daily Average Sentiment Over Time from Negative Sentiment', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
x_ticks = daily_sentiment2['Date'].unique()[::30]  # Adjust the step as needed
plt.xticks(x_ticks, rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image
In [104]:
# Set the style
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=news_ne, x='day', y='vader_comp', marker='o')

# Customize the plot
plt.title('Daily Average Sentiment from Negative Sentiment', fontsize=16)
plt.xlabel('Day', fontsize=14)
plt.ylabel('Average Sentiment', fontsize=14)
# plt.xticks(rotation=90)  # Rotate x-ticks for better readability

# Show the plot
plt.show()
No description has been provided for this image

3-(B). Sentiment over time: Article Numbers¶

In [105]:
news.groupby('year')['vader_sent'].count()
Out[105]:
year
2020     22836
2021     28962
2022     36775
2023    109491
Name: vader_sent, dtype: int64
In [106]:
grouped_data_po = news_po.groupby('year')['vader_sent'].size().reset_index(name = 'count')
In [107]:
grouped_data_po.head()
Out[107]:
year count
0 2020 21413
1 2021 27591
2 2022 35325
3 2023 103232
In [108]:
# Set the style
sns.set(style="white")

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_data_po, x='year', y='count')

# Customize the plot
plt.title('News Article Count(Yearly) from Positive Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Show the plot
plt.show()
No description has been provided for this image
In [109]:
grouped_data_ne = news_ne.groupby('year')['vader_sent'].size().reset_index(name = 'count')
In [110]:
grouped_data_ne.head()
Out[110]:
year count
0 2020 1139
1 2021 1311
2 2022 1361
3 2023 6136
In [111]:
# Set the style
sns.set(style="white")

# Create a bar plot
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_data_ne, x='year', y='count')

# Customize the plot
plt.title('News Article Count(Yearly) from Negative Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Show the plot
plt.show()
No description has been provided for this image
In [112]:
# Create a pivot table
pivot_data = news.pivot_table(index='year', columns='vader_sent', aggfunc='size', fill_value=0)
In [113]:
pivot_data.head()
Out[113]:
vader_sent negative neutral positive
year
2020 1139 284 21413
2021 1311 60 27591
2022 1361 89 35325
2023 6136 123 103232
In [114]:
sns.set(style="white")

# Create a line plot
plt.figure(figsize=(10, 5))
sns.lineplot(data=pivot_data, markers=True, dashes=False)

# Customize the plot
plt.title('Yearly News Article Count by Sentiment', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Place the legend outside of the plot to the right
plt.legend(title='Sentiment', loc='upper left', bbox_to_anchor=(1.01, 1.02))

# Adjust subplot parameters to fit the legend
plt.subplots_adjust(right=0.75)

# Show the plot
plt.show()
No description has been provided for this image
In [115]:
# Combine year and month into a single column
news['year_month'] = news['year'].astype(str) + '-' + news['month'].astype(str).str.zfill(2)

# Group by year_month and vader_sent, and count the occurrences
grouped_data = news.groupby(['year_month', 'vader_sent']).size().reset_index(name='count')
In [116]:
# Set the style
sns.set(style="white")

# Define the sentiments
sentiments = ['positive', 'negative', 'neutral']

# Create separate plots for each sentiment
for sentiment in sentiments:
    # Filter data for the current sentiment
    data_filtered = grouped_data[grouped_data['vader_sent'] == sentiment]

    # Create a bar plot for the current sentiment
    plt.figure(figsize=(10, 5))
    barplot = sns.barplot(data=data_filtered, x='year_month', y='count')

    # Customize the plot
    plt.title(f'Monthly Article Count ({sentiment.capitalize()} Sentiment)', fontsize=16)
    plt.xlabel('Year-Month', fontsize=14)
    plt.ylabel('Article Count', fontsize=14)

    # Rotate and skim x-ticks
    xtick_labels = barplot.get_xticklabels()
    skim_factor = 5  # Adjust this value as needed to skip x-ticks
    barplot.set_xticklabels([label if i % skim_factor == 0 else '' for i, label in enumerate(xtick_labels)], rotation=90)

    # Place the legend outside of the plot
    # plt.legend(title='Sentiment', bbox_to_anchor=(1.01, 1.02), loc='upper left')

    # Show the plot
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [117]:
# Pivot the data for stacked bar plot
pivot_data = grouped_data.pivot(index='year_month', columns='vader_sent', values='count').fillna(0)
In [118]:
# Extend the custom color palette
custom_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]  # Add more colors as needed

# Create a stacked bar plot with an adjusted figure size
plt.figure(figsize=(20, 10))
pivot_data.plot(kind='bar', stacked=True, color=custom_colors)

# Customize the plot
plt.title('Monthly Total Article Count with Sentiment Portions', fontsize=16)
plt.xlabel('Year-Month', fontsize=14)
plt.ylabel('Total Article Count', fontsize=14)

# Rotate and skim x-ticks
plt.xticks(rotation=90)
xtick_labels = plt.gca().get_xticklabels()
skim_factor = 5  # Adjust this value as needed
plt.gca().set_xticklabels([label if i % skim_factor == 0 else '' for i, label in enumerate(xtick_labels)])

# Place the legend outside of the plot
plt.legend(title='Sentiment', bbox_to_anchor=(1.01, 1.02), loc='upper left')

# Show the plot
plt.show()
<Figure size 2000x1000 with 0 Axes>
No description has been provided for this image

4. Word Count¶

4.1. Original Data¶

In [119]:
# Set the style
sns.set(style="white")

# Create a box plot
plt.figure(figsize=(18, 8))
sns.boxplot(data=news, x='vader_sent', y='text_word_count')

# Customize the plot
plt.title('Word Count Distribution by Sentiment', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Word Count', fontsize=14)

# Show the plot
plt.show()
No description has been provided for this image
In [120]:
# Create a violin plot
plt.figure(figsize=(12, 8))
sns.violinplot(data=news, x='vader_sent', y='text_word_count')

# Customize the plot
plt.title('Word Count Distribution by Sentiment', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Word Count', fontsize=14)

# Show the plot
plt.show()
No description has been provided for this image
In [121]:
# Assuming news is your DataFrame and it contains 'vader_sent' and 'text_word_count' columns
plt.figure(figsize=(10, 6))

sentiments = news['vader_sent'].unique()  # Get unique sentiment categories

# Define custom colors for each sentiment category
colors = ['green', 'red', 'gray']  # Adjust the number of colors based on the number of sentiment categories

for i, sentiment in enumerate(sentiments):
    data = news[news['vader_sent'] == sentiment]  # Filter data for each sentiment category
    sns.histplot(data=data, x='text_word_count', label=sentiment, color=colors[i], bins=30, stat='density', element='step')

plt.xlabel('Text Word Count')
plt.ylabel('Density')
plt.title('Distribution of Text Word Count by Sentiment')
plt.legend(title='Sentiment')
plt.show()
No description has been provided for this image

4.2. Data without Outliers¶

In [122]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=news, x='vader_sent', y='text_word_count', showfliers=False)
plt.title('Word Count Distribution by Sentiment (Without Outliers)')
plt.xlabel('Sentiment')
plt.ylabel('Word Count')
plt.show()
No description has been provided for this image
In [123]:
plt.figure(figsize=(12, 8))
sns.violinplot(data=news, x='vader_sent', y='text_word_count', cut=0)
plt.title('Word Count Distribution by Sentiment (Violin Plot)')
plt.xlabel('Sentiment')
plt.ylabel('Word Count')
plt.show()
No description has been provided for this image
In [124]:
news[news['vader_sent'] == 'positive']['text_word_count'].describe()
Out[124]:
count    187561.000000
mean        811.295456
std         611.917261
min           4.000000
25%         488.000000
50%         672.000000
75%         984.000000
max       29325.000000
Name: text_word_count, dtype: float64
In [125]:
news[news['vader_sent'] == 'negative']['text_word_count'].describe()
Out[125]:
count     9947.000000
mean       754.631447
std        528.337370
min          5.000000
25%        421.000000
50%        628.000000
75%        976.000000
max      10490.000000
Name: text_word_count, dtype: float64
In [126]:
news[news['vader_sent'] == 'neutral']['text_word_count'].describe()
Out[126]:
count     556.000000
mean       57.557554
std       166.231279
min         3.000000
25%        10.000000
50%        12.000000
75%        16.000000
max      1402.000000
Name: text_word_count, dtype: float64
In [127]:
def calculate_outlier_thresholds(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return lower_bound, upper_bound
In [128]:
# Calculate thresholds for each sentiment category
positive_thresholds = calculate_outlier_thresholds(news[news['vader_sent'] == 'positive']['text_word_count'])
negative_thresholds = calculate_outlier_thresholds(news[news['vader_sent'] == 'negative']['text_word_count'])
neutral_thresholds = calculate_outlier_thresholds(news[news['vader_sent'] == 'neutral']['text_word_count'])
In [129]:
print(positive_thresholds)
print(negative_thresholds)
print(neutral_thresholds)
(-256.0, 1728.0)
(-411.5, 1808.5)
(1.0, 25.0)
In [130]:
# Set the style
sns.set(style="white")

# Define the figure size
plt.figure(figsize=(10, 6))

# Filter out text_word_count values exceeding 2000
filtered_news = news[news['text_word_count'] <= 2000]

# Define custom colors for each sentiment category
colors = ['green', 'red', 'gray']  # Make sure the number of colors matches the number of sentiment categories

# Get unique sentiment categories
sentiments = filtered_news['vader_sent'].unique()

# Plot overlapping histograms for each sentiment category
for sentiment, color in zip(sentiments, colors):
    # Filter data for each sentiment category
    data = filtered_news[filtered_news['vader_sent'] == sentiment]['text_word_count']
    sns.histplot(data, label=sentiment, color=color, element='step', stat='count', common_norm=False, binwidth=50)

# Customize the plot
plt.xlabel('Text Word Count')
plt.ylabel('Article Count')
plt.title('Distribution of Text Word Count by Sentiment (Word Count ≤ 2000)')

# Place the legend outside the plot
plt.legend(title='Sentiment', bbox_to_anchor=(1.01, 1.02), loc='upper left')

# Show the plot
plt.tight_layout()  # Adjust the layout
plt.show()
No description has been provided for this image

5. Word Cloud¶

In [131]:
from wordcloud import WordCloud
In [132]:
# Function to generate word cloud
def generate_wordcloud(text, title):
    wordcloud = WordCloud(width = 800, height = 400, background_color ='white').generate(text)
    plt.figure(figsize = (10, 5), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.title(title, fontsize=20)
    plt.show()
In [133]:
# Replace 'positive' with 'negative' or 'neutral' as needed
sentiment_text = " ".join(text for text in news[news['vader_sent'] == 'positive']['text_lemm'])

# Generate and plot the word cloud
wordcloud_sentiment = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(sentiment_text)

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_sentiment, interpolation='bilinear')
plt.title('Positive Sentiment Word Cloud', fontsize=20)
plt.axis('off')
plt.show()
No description has been provided for this image
In [134]:
sentiment_text = " ".join(text for text in news[news['vader_sent'] == 'negative']['text_lemm'])

# Generate and plot the word cloud
wordcloud_sentiment = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(sentiment_text)

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_sentiment, interpolation='bilinear')
plt.title('Negative Sentiment Word Cloud', fontsize=20)
plt.axis('off')
plt.show()
No description has been provided for this image
In [135]:
sentiment_text = " ".join(text for text in news[news['vader_sent'] == 'neutral']['text_lemm'])

# Generate and plot the word cloud
wordcloud_sentiment = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(sentiment_text)

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_sentiment, interpolation='bilinear')
plt.title('Neutral Sentiment Word Cloud', fontsize=20)
plt.axis('off')
plt.show()
No description has been provided for this image